Seurat Single Cell Omics Integration

Seurat is a tool developed by the lab of Rahul Satija to facilitate analysis of Single Cell Omics (scOmics) data. Started as a pipeline tool, i.e. a collection of all known pre-processing and scOmics analysis steps, Seurat developed a few original methods on normalization and batch-effect correction. With the rise of Human Cell Atlas (HCA) consortium, Seurat contributed to ideas on data harmonization across multiple labs and technologies. The ambition of HCA was to develop an Atlas of all human cells from all human tissues, that could serve as a refernce for the future research, i.e. human cells from a particular experiment could have been quickly assigned to a particular cell type without performing the regular single cell analysis steps.

Seurat was for a very long time de-facto a standard tool for single cell data analysis in 2016-2018, especially in North America, before first two Seurat articles got published in 2018 and 2019. Nowadays, there are alternative single cell analysis workflows, such as SCRAN (lab of John Marioni at EBI) and SCANPY (lab of Fabian Thejs in Munich), that can compete with Seurat. In the first paper of A. Butler et al., Integrating cingle cell transcriptomic data across different conditions, technilogies and species in Nature Biotechnology, 2018, Seurat suggested an interesting modification of the Canonical Correlation Analysis (CCA), that belongs to the same family of algorithms as PLS, OPLS, JIVE and DISCO. The modification was to include an alignment of canocical correlation vectors (PLS-components) with the Dynamic Time Warping (DTW), which typically serves as a trajectory similarity measure in Time Series data analysis.

The idea of data integration by Seurat is that CCA delivers components representing linear combinations of features across data sets that are maximally correlated (capture correlation structures across data sets), but not necessarily aligned. Next, Dynamic Time Warping (DTW) is used to locally compress or stretch the vectors during alignment to correct for changes in population density. As a result, the data sets are represented in a single, integrated low-dimensional space.

The CCA + DTW strategy was successfully used for data harmonization across multiple conditions, species, labs and technilogies. However, those examples were a sort of single cell oriented batch-effect correction that has been known as a big problem in data analysis for years. In other words, despite CCA + DTW offer an impressive integrative framework, possibly adjusted for single cell data analysis, this methodology is for integration across samples and not straghtforward to extend to the integration across Omics, which is the main chellange and focuse of multi-Omics data integration.

Seurat for Integrating scRNAseq and scATACseq from PBMC Cells

First, we load in the provided peak matrix and collapse the peak matrix to a “gene activity matrix”. Here, we make the simplifying assumption that a gene’s activity can be quantified by simply summing all counts within the gene body + 2kb upstream. Next we build the Seurat object and store the original peaks as “ATAC” assay. As a QC step, we also filter out all cells here with fewer than 10K total counts in the scATAC-seq data.

library("Seurat")
## Registered S3 method overwritten by 'R.oo':
##   method        from       
##   throw.default R.methodsS3
library("ggplot2")
## RStudio Community is a great place to get help:
## https://community.rstudio.com/c/tidyverse.
#Gene activity quantification
peaks <- Read10X_h5("atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
activity.matrix <- read.delim("scatacseq_activity_matrix.txt", header = TRUE, row.names = 1, sep="\t")
colnames(activity.matrix)<-gsub("\\.","-",colnames(activity.matrix))
activity.matrix[1:5,1:5]
##                AAACGAAAGAGCGAAA-1 AAACGAAAGAGTTTGA-1 AAACGAAAGCGAGCTA-1
## RP5-857K21.4                    0                  0                  0
## RP11-206L10.2                   0                  0                  2
## RP11-206L10.10                  0                  0                  0
## LINC01128                       0                  0                  4
## FAM41C                          0                  0                  0
##                AAACGAAAGGCTTCGC-1 AAACGAAAGTGCTGAG-1
## RP5-857K21.4                    0                  0
## RP11-206L10.2                   8                  0
## RP11-206L10.10                  0                  0
## LINC01128                       3                  0
## FAM41C                          4                  0
#Object setup
pbmc.atac <- CreateSeuratObject(counts = peaks, assay = "ATAC", project = "10x_ATAC")
pbmc.atac@assays$ATAC@counts[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##                    AAACGAAAGAGCGAAA-1 AAACGAAAGAGTTTGA-1
## chr1:565107-565550                  .                  .
## chr1:569174-569639                  .                  .
## chr1:713460-714823                  .                  .
## chr1:752422-753038                  .                  .
## chr1:762106-763359                  .                  .
##                    AAACGAAAGCGAGCTA-1 AAACGAAAGGCTTCGC-1
## chr1:565107-565550                  .                  .
## chr1:569174-569639                  .                  .
## chr1:713460-714823                  2                  8
## chr1:752422-753038                  .                  .
## chr1:762106-763359                  4                  2
##                    AAACGAAAGTGCTGAG-1
## chr1:565107-565550                  .
## chr1:569174-569639                  .
## chr1:713460-714823                  .
## chr1:752422-753038                  .
## chr1:762106-763359                  .
pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
pbmc.atac
## An object of class Seurat 
## 108765 features across 8728 samples within 2 assays 
## Active assay: ATAC (89796 features)
##  1 other assay present: ACTIVITY
meta <- read.table("atac_v1_pbmc_10k_singlecell.csv", sep = ",", header = TRUE, row.names = 1, stringsAsFactors = FALSE)
meta <- meta[colnames(pbmc.atac), ]
pbmc.atac <- AddMetaData(pbmc.atac, metadata = meta)
pbmc.atac <- subset(pbmc.atac, subset = nCount_ATAC > 10000)
pbmc.atac$tech <- "atac"

Next we filter the individual data sets and visualize the individual scRNAseq and scATACseq data sets. Here we perform latent semantic indexing to reduce the dimensionality of the scATAC-seq data down to 50 dimensions. This procedure learns an ‘internal’ structure for the scRNA-seq data, and is important when determining the appropriate weights for the anchors when transferring information. We utilize Latent Semantic Indexing (LSI) to learn the structure of ATAC-seq data, as proposed in Cusanovich et al, Science 2015. We also include a pre-processed scRNAseq PBMC cells data set that was used in many other Seurat vignettes as a benchmark data set.

#Data preprocessing
DefaultAssay(pbmc.atac) <- "ACTIVITY"
pbmc.atac <- FindVariableFeatures(pbmc.atac)
pbmc.atac <- NormalizeData(pbmc.atac)
pbmc.atac <- ScaleData(pbmc.atac)
## Centering and scaling data matrix
DefaultAssay(pbmc.atac) <- "ATAC"
VariableFeatures(pbmc.atac) <- names(which(Matrix::rowSums(pbmc.atac) > 1000))
pbmc.atac <- RunLSI(pbmc.atac, n = 50, scale.max = NULL)
## Performing TF-IDF normalization
## Running SVD on TF-IDF matrix
## Scaling cell embeddings
pbmc.atac <- RunTSNE(pbmc.atac, reduction = "lsi", dims = 1:50)

pbmc.rna <- readRDS("pbmc_10k_v3.rds")
pbmc.rna@assays$RNA@counts[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##            rna_AAACCCAAGCGCCCAT-1 rna_AAACCCACAGAGTTGG-1
## AL627309.1                      .                      .
## AL669831.5                      .                      .
## FAM87B                          .                      .
## LINC00115                       .                      .
## FAM41C                          .                      .
##            rna_AAACCCACAGGTATGG-1 rna_AAACCCACATAGTCAC-1
## AL627309.1                      .                      .
## AL669831.5                      .                      .
## FAM87B                          .                      .
## LINC00115                       .                      .
## FAM41C                          .                      .
##            rna_AAACCCACATCCAATG-1
## AL627309.1                      .
## AL669831.5                      .
## FAM87B                          .
## LINC00115                       .
## FAM41C                          .
pbmc.rna$tech <- "rna"
p1 <- DimPlot(pbmc.atac, reduction = "tsne") + NoLegend() + ggtitle("scATAC-seq")
p2 <- DimPlot(pbmc.rna, reduction = "tsne", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("scRNA-seq")
## Warning: Using `as.character()` on a quosure is deprecated as of rlang 0.3.0.
## Please use `as_label()` or `as_name()` instead.
## This warning is displayed once per session.
CombinePlots(plots = list(p1, p2))

Next we will apply the transferring anchors Seurat algorithm and visualize the individual scRNAseq and scATACseq data sets after they have been harmonized.

#Transfer anchors
gc()
##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   2498122  133.5    4683997  250.2   4683997  250.2
## Vcells 503133347 3838.7  902747541 6887.5 741981315 5660.9
transfer.anchors <- FindTransferAnchors(reference = pbmc.rna, query = pbmc.atac, features = VariableFeatures(object = pbmc.rna), 
                                        reference.assay = "RNA", query.assay = "ACTIVITY", reduction = "cca")
## Warning in RunCCA.Seurat(object1 = reference, object2 = query, features =
## features, : Running CCA on different assays
## Running CCA
## Merging objects
## Finding neighborhoods
## Finding anchors
##  Found 16519 anchors
## Filtering anchors
##  Retained 3763 anchors
## Extracting within-dataset neighbors
celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.rna$celltype, 
                                     weight.reduction = pbmc.atac[["lsi"]])
## Finding integration vectors
## Finding integration vector weights
## Predicting cell labels
pbmc.atac <- AddMetaData(pbmc.atac, metadata = celltype.predictions)

hist(pbmc.atac$prediction.score.max)
abline(v = 0.5, col = "red")

table(pbmc.atac$prediction.score.max > 0.5)
## 
## FALSE  TRUE 
##   364  5516
#Visualizing individual scRNAseq and scATACseq after their alignment
pbmc.atac.filtered <- subset(pbmc.atac, subset = prediction.score.max > 0.5)
pbmc.atac.filtered$predicted.id <- factor(pbmc.atac.filtered$predicted.id, levels = levels(pbmc.rna))  # to make the colors match
p1 <- DimPlot(pbmc.atac.filtered, reduction = "tsne", group.by = "predicted.id", label = TRUE, repel = TRUE) + 
  ggtitle("scATAC-seq cells") + NoLegend() + scale_colour_hue(drop = FALSE)
p2 <- DimPlot(pbmc.rna, reduction = "tsne", group.by = "celltype", label = TRUE, repel = TRUE) + ggtitle("scRNA-seq cells") + 
  NoLegend()
CombinePlots(plots = list(p1, p2))

Finally we will perform co-embedding and tSNE visualization of the scRNAseq and scATACseq Omics in their common space after the integration has been done. They demonstrate very encouraging overlapping. Here, we use the same anchors used earlier to transfer cell type labels to impute RNA-seq values for the scATAC-seq cells. We then merge the measured and imputed scRNA-seq data and run a standard tSNE analysis to visualize all the cells together.

#Co-embedding
gc()
##              used   (Mb) gc trigger    (Mb)   max used   (Mb)
## Ncells    2545822  136.0    4683997   250.2    4683997  250.2
## Vcells 1094514663 8350.5 1872366738 14285.1 1217498637 9288.8
genes.use <- VariableFeatures(pbmc.rna)
refdata <- GetAssayData(pbmc.rna, assay = "RNA", slot = "data")[genes.use, ]

# refdata (input) contains a scRNA-seq expression matrix for the scRNA-seq cells.  imputation
# (output) will contain an imputed scRNA-seq matrix for each of the ATAC cells
imputation <- TransferData(anchorset = transfer.anchors, refdata = refdata, weight.reduction = pbmc.atac[["lsi"]])
## Finding integration vectors
## Finding integration vector weights
## Transfering 3000 features onto reference data
# this line adds the imputed data matrix to the pbmc.atac object
pbmc.atac[["RNA"]] <- imputation
coembed <- merge(x = pbmc.rna, y = pbmc.atac)

# Finally, we run PCA and UMAP on this combined object, to visualize the co-embedding of both
# datasets
coembed <- ScaleData(coembed, features = genes.use, do.scale = FALSE)
## Centering data matrix
coembed <- RunPCA(coembed, features = genes.use, verbose = FALSE)
coembed <- RunTSNE(coembed, dims = 1:30)
coembed$celltype <- ifelse(!is.na(coembed$celltype), coembed$celltype, coembed$predicted.id)

p1 <- DimPlot(coembed, group.by = "tech", reduction = "tsne")
p2 <- DimPlot(coembed, group.by = "celltype", label = TRUE, repel = TRUE, reduction = "tsne") + NoLegend()
CombinePlots(plots = list(p1, p2))
## Warning: Graphs cannot be vertically aligned unless the axis parameter is
## set. Placing graphs unaligned.